Skip to content

[fix](parquet) Fix wrong encoding for parquet page v2#63305

Open
Gabriel39 wants to merge 3 commits into
apache:masterfrom
Gabriel39:fix_0515
Open

[fix](parquet) Fix wrong encoding for parquet page v2#63305
Gabriel39 wants to merge 3 commits into
apache:masterfrom
Gabriel39:fix_0515

Conversation

@Gabriel39
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223
Copy link
Copy Markdown
Member

run external

@Gabriel39
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31130 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 7a527c7d811d800a4d48236ef1f8fc93c38e1e3c, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17680	3913	3965	3913
q2	q3	10802	1448	814	814
q4	4700	474	342	342
q5	7542	2287	2105	2105
q6	250	182	140	140
q7	957	781	625	625
q8	9408	1714	1541	1541
q9	5182	4935	4907	4907
q10	6381	2080	1781	1781
q11	438	279	250	250
q12	632	429	305	305
q13	18121	3387	2785	2785
q14	265	261	236	236
q15	q16	822	784	713	713
q17	940	994	1047	994
q18	7026	5640	5593	5593
q19	1326	1346	1081	1081
q20	511	398	260	260
q21	5881	2612	2436	2436
q22	438	368	309	309
Total cold run time: 99302 ms
Total hot run time: 31130 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4322	4275	4283	4275
q2	q3	4551	4913	4318	4318
q4	2149	2209	1432	1432
q5	4399	4322	4328	4322
q6	237	186	137	137
q7	2212	1963	1621	1621
q8	2783	2279	2205	2205
q9	8029	8039	7858	7858
q10	4555	4497	3996	3996
q11	595	432	391	391
q12	777	730	533	533
q13	3375	3624	3010	3010
q14	290	291	282	282
q15	q16	736	736	652	652
q17	1404	1343	1401	1343
q18	7809	7384	7301	7301
q19	1207	1153	1161	1153
q20	2219	2219	1959	1959
q21	5403	4667	4579	4579
q22	523	479	422	422
Total cold run time: 57575 ms
Total hot run time: 51789 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169421 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 7a527c7d811d800a4d48236ef1f8fc93c38e1e3c, data reload: false

query5	4322	650	514	514
query6	342	217	200	200
query7	4262	573	303	303
query8	336	235	221	221
query9	8818	3992	3977	3977
query10	442	346	289	289
query11	5819	2366	2215	2215
query12	193	137	125	125
query13	1273	596	444	444
query14	6014	5382	5088	5088
query14_1	4389	4417	4350	4350
query15	211	205	184	184
query16	997	486	451	451
query17	1171	762	619	619
query18	2768	477	345	345
query19	226	194	155	155
query20	139	131	131	131
query21	216	143	119	119
query22	13602	13666	13310	13310
query23	17205	16321	16088	16088
query23_1	16126	16167	16250	16167
query24	7607	1769	1288	1288
query24_1	1307	1317	1291	1291
query25	546	474	427	427
query26	1321	331	177	177
query27	2638	558	338	338
query28	4440	1938	1923	1923
query29	981	627	504	504
query30	313	243	197	197
query31	1109	1079	942	942
query32	87	77	72	72
query33	518	356	290	290
query34	1169	1125	635	635
query35	791	783	685	685
query36	1302	1349	1221	1221
query37	156	112	93	93
query38	3218	3137	3083	3083
query39	953	940	903	903
query39_1	870	903	891	891
query40	228	145	126	126
query41	67	65	65	65
query42	109	110	108	108
query43	321	325	296	296
query44	
query45	214	209	197	197
query46	1093	1187	725	725
query47	2295	2416	2231	2231
query48	411	406	294	294
query49	634	502	386	386
query50	929	363	254	254
query51	4348	4301	4254	4254
query52	110	106	93	93
query53	253	282	210	210
query54	323	286	272	272
query55	96	93	85	85
query56	311	311	310	310
query57	1403	1410	1486	1410
query58	310	272	266	266
query59	1613	1617	1431	1431
query60	334	332	352	332
query61	160	161	159	159
query62	663	641	552	552
query63	250	201	204	201
query64	2383	826	639	639
query65	
query66	1669	511	360	360
query67	30086	30018	29792	29792
query68	
query69	459	341	308	308
query70	1054	993	981	981
query71	300	279	276	276
query72	3021	2701	2420	2420
query73	839	757	443	443
query74	5087	4924	4775	4775
query75	2702	2641	2277	2277
query76	2325	1137	736	736
query77	410	428	358	358
query78	12122	12098	11581	11581
query79	1276	1031	736	736
query80	646	585	499	499
query81	451	286	259	259
query82	247	160	129	129
query83	282	290	264	264
query84	277	149	115	115
query85	945	635	616	616
query86	366	360	321	321
query87	3382	3404	3233	3233
query88	3489	2659	2662	2659
query89	429	384	340	340
query90	2201	181	186	181
query91	177	171	139	139
query92	81	81	72	72
query93	1479	1461	936	936
query94	522	348	312	312
query95	662	385	352	352
query96	1080	778	355	355
query97	2684	2743	2549	2549
query98	236	227	228	227
query99	1106	1109	1003	1003
Total cold run time: 252108 ms
Total hot run time: 169421 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 100.00% (2/2) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.49% (20640/38587)
Line Coverage 37.15% (195074/525072)
Region Coverage 33.54% (152740/455408)
Branch Coverage 34.56% (66570/192620)

Gabriel39 added 2 commits May 18, 2026 11:10
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add a BE unit test for rejecting Parquet DATA_PAGE_V2 chunks that contain non-dictionary data page encodings in encoding_stats.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Added BE UT for RowGroupReader::is_dictionary_encoded DATA_PAGE_V2 encoding_stats handling. Attempted ./run-be-ut.sh --run --filter=ParquetThriftReaderTest.is_dictionary_encoded_rejects_plain_data_page_v2, but local environment has JAVA_HOME on JDK 11 and run-be-ut.sh requires JDK 17. Attempted build-support/clang-format.sh, but local environment lacks llvm@16.
- Behavior changed: No
- Does this need documentation: No
@Gabriel39
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 100.00% (2/2) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.49% (20643/38589)
Line Coverage 37.16% (195106/525077)
Region Coverage 33.53% (152721/455409)
Branch Coverage 34.57% (66585/192620)

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31243 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 7939d15874b575d3c8541761d127fc529c7a52d4, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17698	4029	3999	3999
q2	q3	10813	1460	827	827
q4	4685	479	351	351
q5	7586	2295	2141	2141
q6	238	182	144	144
q7	1016	777	654	654
q8	9720	1789	1560	1560
q9	5146	4908	4893	4893
q10	6323	2078	1784	1784
q11	451	288	253	253
q12	636	428	310	310
q13	18165	3281	2778	2778
q14	264	259	242	242
q15	q16	822	797	706	706
q17	1000	1007	966	966
q18	7022	5836	5510	5510
q19	1304	1296	1060	1060
q20	551	422	271	271
q21	6050	2636	2478	2478
q22	436	370	316	316
Total cold run time: 99926 ms
Total hot run time: 31243 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4223	4175	4335	4175
q2	q3	4496	4888	4314	4314
q4	2161	2245	1411	1411
q5	4450	4301	4352	4301
q6	238	185	139	139
q7	2410	1968	1750	1750
q8	2635	2345	2251	2251
q9	8210	7835	7873	7835
q10	4504	4469	4075	4075
q11	574	474	443	443
q12	790	782	542	542
q13	3298	3678	2909	2909
q14	304	301	281	281
q15	q16	720	751	640	640
q17	1383	1348	1408	1348
q18	8201	7199	7329	7199
q19	1188	1172	1118	1118
q20	2230	2227	1930	1930
q21	5506	4705	4559	4559
q22	524	473	443	443
Total cold run time: 58045 ms
Total hot run time: 51663 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 168762 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 7939d15874b575d3c8541761d127fc529c7a52d4, data reload: false

query5	4327	658	525	525
query6	324	221	239	221
query7	4209	548	321	321
query8	335	242	243	242
query9	8818	4073	4067	4067
query10	463	353	305	305
query11	5780	2416	2211	2211
query12	185	130	132	130
query13	1340	606	448	448
query14	6077	5379	5074	5074
query14_1	4393	4363	4390	4363
query15	215	208	186	186
query16	991	451	422	422
query17	981	719	595	595
query18	2495	496	368	368
query19	225	201	163	163
query20	135	134	133	133
query21	215	148	118	118
query22	13573	13472	13313	13313
query23	17228	16413	15925	15925
query23_1	16259	16163	16190	16163
query24	7578	1755	1334	1334
query24_1	1357	1296	1321	1296
query25	610	500	428	428
query26	1346	328	170	170
query27	2664	596	352	352
query28	4517	1959	1953	1953
query29	995	647	524	524
query30	313	240	207	207
query31	1118	1088	949	949
query32	93	81	76	76
query33	570	376	325	325
query34	1221	1140	643	643
query35	765	770	679	679
query36	1332	1329	1130	1130
query37	157	105	90	90
query38	3246	3206	3059	3059
query39	927	922	940	922
query39_1	879	889	862	862
query40	223	151	123	123
query41	67	64	63	63
query42	114	110	110	110
query43	325	326	298	298
query44	
query45	214	201	192	192
query46	1074	1247	722	722
query47	2278	2379	2195	2195
query48	396	416	289	289
query49	631	496	396	396
query50	1061	343	254	254
query51	4310	4293	4210	4210
query52	104	108	94	94
query53	259	291	215	215
query54	316	269	259	259
query55	94	92	84	84
query56	301	307	316	307
query57	1387	1393	1280	1280
query58	299	273	261	261
query59	1539	1651	1418	1418
query60	329	327	300	300
query61	164	160	157	157
query62	667	618	557	557
query63	251	200	209	200
query64	2406	816	635	635
query65	
query66	1716	483	407	407
query67	29998	29925	29146	29146
query68	
query69	450	351	299	299
query70	1043	1020	979	979
query71	303	273	277	273
query72	2985	2501	2456	2456
query73	815	799	431	431
query74	5091	4947	4751	4751
query75	2685	2602	2263	2263
query76	2284	1151	780	780
query77	404	437	330	330
query78	12108	12193	11597	11597
query79	1426	1048	724	724
query80	653	539	454	454
query81	452	283	239	239
query82	1383	161	127	127
query83	363	273	258	258
query84	302	138	115	115
query85	872	541	455	455
query86	391	346	349	346
query87	3397	3409	3240	3240
query88	3533	2681	2654	2654
query89	437	388	339	339
query90	1983	189	183	183
query91	181	169	138	138
query92	82	82	77	77
query93	1483	1559	843	843
query94	527	346	324	324
query95	683	479	351	351
query96	1003	804	332	332
query97	2708	2706	2550	2550
query98	254	225	236	225
query99	1154	1090	967	967
Total cold run time: 253126 ms
Total hot run time: 168762 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (2/2) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.77% (27121/37789)
Line Coverage 55.12% (288662/523700)
Region Coverage 52.25% (240241/459824)
Branch Coverage 53.49% (103429/193350)

1 similar comment
@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (2/2) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.77% (27121/37789)
Line Coverage 55.12% (288662/523700)
Region Coverage 52.25% (240241/459824)
Branch Coverage 53.49% (103429/193350)

@Gabriel39
Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 18, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: request changes.

Critical checkpoint conclusions:

  • Goal/test proof: The production change aims to reject non-dictionary DATA_PAGE_V2 pages for Parquet dictionary filtering, and the main code change is directionally correct, but the added BE unit test does not compile because it calls a private method directly.
  • Scope/focus: The production change is small and focused. The test access pattern needs to be corrected without broadening behavior unnecessarily.
  • Concurrency/lifecycle/config/compatibility: No new concurrency, special lifecycle, configuration, storage format, or FE-BE protocol compatibility concerns found.
  • Parallel code paths: There is a similar Iceberg dictionary-encoding helper/test path that already covers DATA_PAGE_V2; no additional blocking issue found beyond the new Parquet test compile failure.
  • Special checks/error handling/memory/transactions/data writes: No new concerns found in the changed production logic.
  • Tests/results: Regression coverage was added, but BE unit coverage is currently broken at compile time, so it cannot validate the fix until the private-access issue is resolved.
  • Observability/performance: No additional observability or performance issue found for this small metadata check.

User focus: No additional user-provided review focus was present.

position_delete_ctx, lazy_read_ctx, nullptr, column_ids,
filter_column_ids);

EXPECT_FALSE(row_group_reader.is_dictionary_encoded(column_metadata));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test will not compile as written: RowGroupReader::is_dictionary_encoded is declared in the private section of vparquet_group_reader.h, and this test file does not use a friend/test helper or a private visibility override. Because the new BE UT directly calls row_group_reader.is_dictionary_encoded(...), the target fails at compile time instead of validating the DATA_PAGE_V2 fix. Please expose this logic through a testable helper (or test through the public reader path) before adding the assertion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.0.x dev/3.1.x dev/4.0.x dev/4.1.x reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants